This course is just what I need now.
We are writing couple of articles in psychoacoustics and need to use GAM and PCA analysis. This helps me to go deeper in R programming and brings also better understanding, what my colleague is doing in GAM (General Additive Model). Earlier MOOC courses in statistics have been interesting and useful. This is an excellent continuum.
I have a new GitHub repository dedicated to this project:
https://github.com/jussijaatinen/IODS-project
The data has been collected in query about teaching and learning after the course of “Introduction to statistics for social science”“, fall 2014. The most essential parts of study (Deep, Surface and Strategic learning) have been named and combined so, that connection to the thought dimensions can be seen.
There were twice as much females than men in answer group. Dataset has 166 observations and 7 variables.
gender Age Attitude deep stra
F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
Median :22.00 Median :3.200 Median :3.667 Median :3.188
Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
surf Points
Min. :1.583 Min. : 7.00
1st Qu.:2.417 1st Qu.:19.00
Median :2.833 Median :23.00
Mean :2.787 Mean :22.72
3rd Qu.:3.167 3rd Qu.:27.75
Max. :4.333 Max. :33.00
'data.frame': 166 obs. of 7 variables:
$ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
$ Age : int 53 55 49 53 49 38 50 37 37 42 ...
$ Attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
$ deep : num 3.58 2.92 3.5 3.5 3.67 ...
$ stra : num 3.38 2.75 3.62 3.12 3.62 ...
$ surf : num 2.58 3.17 2.25 2.25 2.83 ...
$ Points : int 25 12 24 10 22 21 21 31 24 26 ...
Due to largest correlation with dependent variable “Points”, three variables were chosen to the model. The independent variables were Attitude (R=0.44), Strategic learning (R=0.15) and Surface Learning (R=-0.14). Genders were combined in correlation test.
For testing, which contributions of chosen independent variables were significant, several possible alternatives were evaluated for the best regression model.
Call:
lm(formula = learning2014$Points ~ learning2014$Attitude + learning2014$stra +
learning2014$surf, data = learning2014)
Residuals:
Min 1Q Median 3Q Max
-17.1550 -3.4346 0.5156 3.6401 10.8952
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.0171 3.6837 2.991 0.00322 **
learning2014$Attitude 3.3952 0.5741 5.913 1.93e-08 ***
learning2014$stra 0.8531 0.5416 1.575 0.11716
learning2014$surf -0.5861 0.8014 -0.731 0.46563
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.296 on 162 degrees of freedom
Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Call:
lm(formula = learning2014$Points ~ learning2014$Attitude + learning2014$stra,
data = learning2014)
Residuals:
Min 1Q Median 3Q Max
-17.6436 -3.3113 0.5575 3.7928 10.9295
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.9729 2.3959 3.745 0.00025 ***
learning2014$Attitude 3.4658 0.5652 6.132 6.31e-09 ***
learning2014$stra 0.9137 0.5345 1.709 0.08927 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.289 on 163 degrees of freedom
Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Call:
lm(formula = learning2014$Points ~ learning2014$Attitude + learning2014$surf,
data = learning2014)
Residuals:
Min 1Q Median 3Q Max
-17.277 -3.236 0.386 3.977 10.642
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.1196 3.1271 4.515 1.21e-05 ***
learning2014$Attitude 3.4264 0.5764 5.944 1.63e-08 ***
learning2014$surf -0.7790 0.7956 -0.979 0.329
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.32 on 163 degrees of freedom
Multiple R-squared: 0.1953, Adjusted R-squared: 0.1854
F-statistic: 19.78 on 2 and 163 DF, p-value: 2.041e-08
Call:
lm(formula = learning2014$Points ~ learning2014$Attitude + learning2014$stra +
learning2014$gender, data = learning2014)
Residuals:
Min 1Q Median 3Q Max
-17.7179 -3.3285 0.5343 3.7412 10.9007
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.9798 2.4030 3.737 0.000258 ***
learning2014$Attitude 3.5100 0.5956 5.893 2.13e-08 ***
learning2014$stra 0.8911 0.5441 1.638 0.103419
learning2014$genderM -0.2236 0.9248 -0.242 0.809231
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.304 on 162 degrees of freedom
Multiple R-squared: 0.2051, Adjusted R-squared: 0.1904
F-statistic: 13.93 on 3 and 162 DF, p-value: 3.982e-08
Call:
lm(formula = learning2014$Points ~ learning2014$Attitude, data = learning2014)
Residuals:
Min 1Q Median 3Q Max
-16.9763 -3.2119 0.4339 4.1534 10.6645
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
learning2014$Attitude 3.5255 0.5674 6.214 4.12e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.32 on 164 degrees of freedom
Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
As seen above, only Attitude was statistically significant of chosen variables.
If comparing R squared (R2) values between models, it can be seen that Attitude variable alone explains almost as large amount of variance as if combined with other two of tested variable. Still it explains only a fraction (< 20%) of total variance of this data.
For testing validity of model and assumption of normal distributed samples following tests have been done: Residuals vs Fitted, Normal Q-Q and Residuals vs Leverage. As we can see in Residuals vs Fitted, residuals are reasonable normally distributed. In Normal Q-Q plot we can see that residuals follow linearity well enough (except far ends). In Residual vs Leverage image there are no outliers in isolated positions far from mean. Therefore assumption of linearity and normal distribution can be accepted.
The data of this study has been collected from two Portugese second grade schools by using school reports and questionnaires. Aim of this analysis is to find relationships between high alcohol consumption and hypothesized explanatory variables.
Hyphotheses:
According to hyphotheses, the selected explanatory variables are sex, go out, family relations and absences. According to first hyphothesis, all analyses have been divided to two groups, females and males.
In dataset there are lot of background variables like demographic, social relationships, parents´education and employment, or living area as well as grades, absences or health.
Full descriptions of study is available here: https://archive.ics.uci.edu/ml/datasets/Student+Performance.
Observations about selected explanatory variables:
For alcohol consumption related study, the original data has been modified by combining and averaging weekend and weekdays alcohol consumptions. As result two new variables have been created: alc_use (Likert, 1-5, where 1 = very low, 5 = very high) and high_use (logical, TRUE > 2).
Total number of variables after adding is 35. Number of observations is 382. All variables and their distributions can be seen below.
As seen in table below, in generally, males consume more alcohol. In low consumption group (very low or low = FALSE), 58% were females and 42% males, whereas in higher consumption group only 37% were females and 63% males.
FALSE TRUE Sum
F 156 42 198
M 112 72 184
Sum 268 114 382
Having more freetime outside home with friends increases alcohol consumption in both sexes significantly.
Although absences increase due to higher alcohol consumption, difference is quite moderate, thus larger in males.
Although same tendency occur also in this case, there also might be two-sided correlation between alcohol consumption and family relations. Which of them is more cause or consequence is unclear. Bad relationships in family may lead bigger alcohol consumption, but also drinking may evoke collisions between family members.
In logistic binomial regression selected explanatory variables (discret or continuous) explain dependent binary (high_use) variable. In this case, dependent variable is categorical, alcohol high use (high_use). It has constructed by classifying aforementioned alc_use variable. Values 1 and 2 of alc_use have been classified to low or false and values 3 - 5 to high or true.
All chosen variables (sex, goout, famrel and absences) are statistically significant. Especially sex is most significant of all and its odd ratio was the highest (2.84). This means that propability for high alcohol consumption in males is almost three times higher than in females. Another strongly influencing variable is “Going out with friends” (goout). None of the included variables include 1 in confidental interval, which means they all have meaningful influence to the model.
Call:
glm(formula = high_use ~ famrel + absences + sex + goout, family = "binomial",
data = alc)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7151 -0.7820 -0.5137 0.7537 2.5463
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.76826 0.66170 -4.184 2.87e-05 ***
famrel -0.39378 0.14035 -2.806 0.005020 **
absences 0.08168 0.02200 3.713 0.000205 ***
sexM 1.01234 0.25895 3.909 9.25e-05 ***
goout 0.76761 0.12316 6.232 4.59e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 465.68 on 381 degrees of freedom
Residual deviance: 379.81 on 377 degrees of freedom
AIC: 389.81
Number of Fisher Scoring iterations: 4
OR 2.5 % 97.5 %
(Intercept) 0.06277092 0.01648406 0.2225470
famrel 0.67449969 0.51060362 0.8869199
absences 1.08510512 1.04012840 1.1353940
sexM 2.75203774 1.66768032 4.6128191
goout 2.15461275 1.70345866 2.7639914
With chosen variables, predictive power of model (penalty) is ~ 0.20 (TRUE-FALSE + FALSE-TRUE pairs summed). This means that about 4/5 of predictions are correct (TRUE-TRUE + FALSE-FALSE pairs summed).
prediction
high_use FALSE TRUE Sum
FALSE 0.66492147 0.03664921 0.70157068
TRUE 0.16753927 0.13089005 0.29842932
Sum 0.83246073 0.16753927 1.00000000
Mean prediction error of the model (loss function). This is determined by calculating propability of incorrect predictions.
[1] 0.2041885
For testing power of the chosen model, a method called cross-validation can be used. In cross validation sample data has been divided to test and train sections. This means that for example data is divided to 5 equal size parts and 4/5 is used as train data and 1/5 as test data. That is called 5-fold cross-validation. In K-fold CV data has divided to K parts, in this case K = number of observations. So, train data includes K-1 observations and test data 1 observations. In this function this process is circulated so all parts of data is in turn either train or test data.
In this case, K-fold cross-validation gives slighly higher prediction error (0.21) compared to mean prediction error.
[1] 0.2146597
This model have smaller prediction error (0.21) using 10-fold cross-validation compared to the model introduced in DataCamp (which had about 0.26 error). This means that chosen variables in my model predict better high alcohol consumption.
[1] 0.2146597
The highest amount of predictors is in my original model introduced earlier. Other models are reduced combinations of original model or added to the model previously unused variables (N=18). In 10-fold cross validation, the smallest penalty loss (0.21) was in my original model. Changing variables increased or decreased penalty loss depending on significancy of chosen variables.
OR 2.5 % 97.5 %
(Intercept) 0.06277092 0.01648406 0.2225470
famrel 0.67449969 0.51060362 0.8869199
absences 1.08510512 1.04012840 1.1353940
sexM 2.75203774 1.66768032 4.6128191
goout 2.15461275 1.70345866 2.7639914
[1] 0.2198953
OR 2.5 % 97.5 %
(Intercept) 0.4877368 0.1702903 1.3635949
famrel 0.7475910 0.5811204 0.9595347
absences 1.0980228 1.0524564 1.1504519
sexM 2.7976292 1.7476499 4.5467291
[1] 0.2643979
OR 2.5 % 97.5 %
(Intercept) 0.7187304 0.2621851 1.935500
famrel 0.7857743 0.6158556 1.001830
absences 1.0905430 1.0452818 1.142647
[1] 0.2879581
OR 2.5 % 97.5 %
(Intercept) 1.1938851 0.4610715 3.0668110
famrel 0.7673146 0.6041200 0.9734032
[1] 0.3010471
OR 2.5 % 97.5 %
(Intercept) 0.01555815 0.005885392 0.03804621
absences 1.08782902 1.042458467 1.13933894
sexM 2.60835292 1.593132148 4.33151387
goout 2.07468346 1.650182481 2.64111050
[1] 0.2198953
OR 2.5 % 97.5 %
(Intercept) 0.0917904 0.02572428 0.3085338
famrel 0.7047893 0.53909482 0.9181526
absences 1.0772983 1.03352289 1.1268709
goout 2.1613089 1.71931650 2.7529251
[1] 0.2486911
OR 2.5 % 97.5 %
(Intercept) 0.4877368 0.1702903 1.3635949
famrel 0.7475910 0.5811204 0.9595347
absences 1.0980228 1.0524564 1.1504519
sexM 2.7976292 1.7476499 4.5467291
[1] 0.2643979
OR 2.5 % 97.5 %
(Intercept) 0.2775097 0.2012914 0.3760343
absences 1.0939748 1.0479459 1.1468072
[1] 0.2853403
OR 2.5 % 97.5 %
(Intercept) 0.01555815 0.005885392 0.03804621
absences 1.08782902 1.042458467 1.13933894
sexM 2.60835292 1.593132148 4.33151387
goout 2.07468346 1.650182481 2.64111050
[1] 0.2146597
OR 2.5 % 97.5 %
(Intercept) 0.09489664 0.02602013 0.3241397
famrel 0.66248876 0.50299888 0.8679153
sexM 2.54505957 1.56349803 4.1969640
goout 2.21913298 1.76086023 2.8372779
[1] 0.2356021
OR 2.5 % 97.5 %
(Intercept) 0.4877368 0.1702903 1.3635949
famrel 0.7475910 0.5811204 0.9595347
absences 1.0980228 1.0524564 1.1504519
sexM 2.7976292 1.7476499 4.5467291
[1] 0.2539267
OR 2.5 % 97.5 %
(Intercept) 0.2692308 0.1891905 0.3746607
sexM 2.3877551 1.5268584 3.7716682
[1] 0.2984293
OR 2.5 % 97.5 %
(Intercept) 0.8774021 0.3274356 2.3203334
famrel 0.7331474 0.5730317 0.9356179
sexM 2.5191408 1.6005358 4.0108387
[1] 0.2905759
OR 2.5 % 97.5 %
(Intercept) 0.159445 0.1012577 0.2427684
absences 1.101409 1.0549317 1.1548057
sexM 2.658116 1.6710354 4.2863129
[1] 0.2539267
OR 2.5 % 97.5 %
(Intercept) 0.02208168 0.008754964 0.05186666
sexM 2.39559680 1.484457552 3.91015147
goout 2.14025612 1.708110942 2.71701439
[1] 0.2277487
OR 2.5 % 97.5 %
(Intercept) 0.1314829 0.03816207 0.4283774
famrel 0.6930400 0.53161279 0.8999454
goout 2.2081701 1.76229526 2.8045361
[1] 0.2617801
OR 2.5 % 97.5 %
(Intercept) 0.026083 0.01076768 0.05922061
absences 1.080069 1.03581041 1.13071158
goout 2.087359 1.66817170 2.64318173
[1] 0.2356021
OR 2.5 % 97.5 %
(Intercept) 0.03504298 0.01505792 0.07687797
goout 2.13743726 1.71351154 2.69965289
[1] 0.2696335
OR 2.5 % 97.5 %
(Intercept) 0.006582546 8.420856e-05 0.4625185
famrel 0.642908900 4.804204e-01 0.8557823
absences 1.084677514 1.038897e+00 1.1349851
sexM 2.362741264 1.404117e+00 4.0253561
G3 0.978352444 8.988314e-01 1.0653837
freetime 1.143635232 8.645141e-01 1.5163385
age 1.129138861 8.986345e-01 1.4213128
failures 1.194170133 7.676563e-01 1.8609556
famsizeLE3 1.416227556 8.057840e-01 2.4756274
romanticyes 0.627696639 3.507903e-01 1.1001635
health 1.145482751 9.468044e-01 1.3925122
goout 2.042410603 1.593579e+00 2.6564950
[1] 0.2277487
Boston data is included in R-package as a demonstration or example.
Dataset contains social, environmental and economical information about great Boston area. It includes following variables:
Dataset has 14 variables and 506 observations and all variables are numerical.
'data.frame': 506 obs. of 14 variables:
$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : int 1 2 2 3 3 3 5 5 5 5 ...
$ tax : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ black : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
[1] 506 14
As seen in pairs plot, most of the variables are not normally distributed. Most of them are skewed and some of them are bimodal. Correlations between variables are better viewed in correlation plotting, where on the upper-right side the biggest circles indicate highest correlations (blue = positive or red = negative). Corresponding number values are mirrored on the lower-left side.
crim zn indus chas
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
nox rm age dis
Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
rad tax ptratio black
Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
Median : 5.000 Median :330.0 Median :19.05 Median :391.44
Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
lstat medv
Min. : 1.73 Min. : 5.00
1st Qu.: 6.95 1st Qu.:17.02
Median :11.36 Median :21.20
Mean :12.65 Mean :22.53
3rd Qu.:16.95 3rd Qu.:25.00
Max. :37.97 Max. :50.00
In standardization means of all variables are in zero. That is, variables have distributed around zero. This can be seen in summary table (compare with original summary above).
Variable crime rate has been changed to categorical variable with 4 levels: low, med_low, med_high and high. Each class includes quantile of data (25%).
Train and test sets have been created by dividing original (standardized) data to two groups randomly. 80% belongs to train set and 20% to test set.
crim zn indus
Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
Median :-0.390280 Median :-0.48724 Median :-0.2109
Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
chas nox rm age
Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
dis rad tax ptratio
Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
black lstat medv
Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
Median : 0.3808 Median :-0.1811 Median :-0.1449
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
In linear discriminant analysis (LDA) only the train set (80% of data) has been analysed. Target variable is the new categorical variable, crime rate (low, med_low, med_high, high). In LDA model all other variables of the data set are used as predictor variables (see Overview of data).
In biplot below can be seen that variable “rad” (index of accessibility to radial highways) has extremely high influence to LD1 and LD2 if compared to the other variables. In biplot all horizontal vectors describes contribution to LD1 dimension (x-axis) and vertical vectors LD2-dimension (y-axis). Sign of coefficient of linear discriminant determines the direction of vector. The longer the vector, the bigger is influence. Most of the vectors contribute both LD1 and LD2. Because in biplot two dimensions are illustrated, directions of most of variables are in different angles between LD1 and LD 2. For example, in the LDA table below the most significant variable of LD1 “rad” has coefficients LD1 = 3.27 and LD2 = 1.05. They are directly readable as coordinates of the arrow head. Similarly the second most significant variable of LD2, “nox” has its head ccordinates in (-0.69, 0.29). LDA1 alone explains 0.95% of model. LD2 explains 3% and LD3 only 1%.
Call:
lda(crime ~ ., data = train)
Prior probabilities of groups:
low med_low med_high high
0.2574257 0.2400990 0.2549505 0.2475248
Group means:
zn indus chas nox rm
low 1.07571389 -0.9427007 -0.158758869 -0.8964515 0.43243574
med_low -0.09803072 -0.2872737 -0.069385757 -0.5743277 -0.14520873
med_high -0.41147670 0.2142348 0.262810770 0.4186210 0.09221636
high -0.48724019 1.0171519 0.003267949 1.0513236 -0.40045192
age dis rad tax ptratio
low -0.9425473 0.9390399 -0.6837115 -0.7226589 -0.43425988
med_low -0.3748720 0.3361986 -0.5414282 -0.4601149 -0.02612746
med_high 0.4298869 -0.3898599 -0.4054077 -0.3010732 -0.27633584
high 0.8267865 -0.8608197 1.6377820 1.5138081 0.78037363
black lstat medv
low 0.37470204 -0.78062702 0.51791819
med_low 0.31170089 -0.13354902 -0.01343118
med_high 0.05418107 0.02148723 0.16396143
high -0.82054595 0.88571380 -0.65670740
Coefficients of linear discriminants:
LD1 LD2 LD3
zn 0.08351233 0.737792205 -0.890241931
indus 0.03639139 -0.307827188 0.348037260
chas -0.06967471 -0.068413547 0.044695253
nox 0.31515640 -0.620523914 -1.384900490
rm -0.12656154 -0.139690019 -0.128898300
age 0.27685730 -0.293502228 -0.326403775
dis -0.05676486 -0.174510049 -0.009029386
rad 3.15689728 0.807839552 -0.057102082
tax 0.01006334 0.190689070 0.500024658
ptratio 0.09567495 0.009425240 -0.234358605
black -0.11691521 0.000273517 0.092761854
lstat 0.21755559 -0.232685538 0.359625464
medv 0.17643770 -0.347199630 -0.207212456
Proportion of trace:
LD1 LD2 LD3
0.9393 0.0466 0.0141
In the test dataset catecorigal crime variable has been removed. In the table below true values of the original test data and predicted values of the test data (crime removed) are cross-tabulated. Total amount of observations is 102 (506/5 +1). In the table on diagonal axis (from top-left corner) are true values (sum = 76) and all others are predicted values (sum = 26). Prediction error is 26/102 ≈ 0.25
predicted
correct low med_low med_high high Sum
low 8 12 3 0 23
med_low 3 21 5 0 29
med_high 0 5 17 1 23
high 0 0 0 27 27
Sum 11 38 25 28 102
In this model euclidean distance matrix has been calculated. Results can be seen in table below. By using K-means algorithm, the optimal number of clusters can be investigated. When TWSS (total within sum of squares) drops significally, it indicates optimal number of clusters. In this case optimal number of clusters is 2 or 3. In the first plotting, data has classified into two and in the second plotting three clusters.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Here LDA is calculated with the clusters as target classes. All other variables in the Boston data are predictor variables. In LDA tables and biplots, differences between number of clusters can be seen. Depending on number of clusters, meaningful variables are different, as seen in plottings.
Call:
lda(clu3 ~ ., data = boston_scaled_new)
Prior probabilities of groups:
1 2 3
0.3083004 0.4130435 0.2786561
Group means:
crim zn indus nox rm age
1 0.8647905 -0.4872402 1.0949412 1.1524404 -0.4717159 0.78718751
2 -0.3763256 -0.3947158 -0.1595373 -0.2992872 -0.2772450 0.02424663
3 -0.3989734 1.1241495 -0.9749470 -0.8314162 0.9328502 -0.90687091
dis rad tax ptratio black lstat
1 -0.8532750 1.3172714 1.3530841 0.57186480 -0.6936034 0.88830984
2 0.0304962 -0.5856776 -0.5482746 0.09789185 0.2769803 -0.03824517
3 0.8988453 -0.5892747 -0.6843385 -0.77780359 0.3568316 -0.92612124
medv
1 -0.6893319
2 -0.1525634
3 0.9888052
Coefficients of linear discriminants:
LD1 LD2
crim 0.03403154 0.1907082
zn 0.04097709 0.8724398
indus -0.41998138 -0.1847419
nox -1.06730082 0.6654667
rm 0.26776386 0.5104809
age 0.21123453 -0.4670361
dis -0.07658281 0.3338914
rad -1.17679746 0.3612912
tax -0.96644205 0.5042929
ptratio -0.06864005 -0.1601563
black 0.07085932 -0.0359700
lstat -0.26379244 0.3189355
medv 0.01076674 0.6395889
Proportion of trace:
LD1 LD2
0.8729 0.1271
Call:
lda(clu4 ~ ., data = boston_scaled_new)
Prior probabilities of groups:
1 2 3 4
0.4011858 0.1660079 0.3221344 0.1106719
Group means:
crim zn indus nox rm age
1 -0.3793592 -0.3541732 -0.2369409 -0.3788647 -0.3076659 -0.09191194
2 -0.4124621 1.9031602 -1.0764395 -1.1428600 0.6095020 -1.39546865
3 0.8082769 -0.4872402 1.1165562 1.1413403 -0.4676591 0.79696608
4 -0.3587926 -0.1526454 -0.7764063 -0.2344411 1.5622579 0.10664320
dis rad tax ptratio black lstat
1 0.1379588 -0.5920713 -0.5838167 0.08129234 0.2784401 -0.06159606
2 1.5205894 -0.6250261 -0.5943244 -0.67561813 0.3537454 -0.90687350
3 -0.8539968 1.2199444 1.2927317 0.58616084 -0.6486732 0.87910380
4 -0.2952439 -0.4671119 -0.7549503 -0.98740427 0.3481389 -0.97522401
medv
1 -0.1690186
2 0.6658984
3 -0.7034406
4 1.6613592
Coefficients of linear discriminants:
LD1 LD2 LD3
crim -0.03692074 -0.063843487 0.17437690
zn -0.10469860 -1.680125461 -0.02034829
indus 0.62305615 -0.375784264 -0.51749840
nox 1.07818726 -0.501408664 0.49444666
rm -0.13015732 -0.062534323 0.64219861
age -0.18778510 0.593603356 0.12209172
dis 0.01495296 -0.528845086 -0.16022047
rad 0.71521529 0.091708443 0.26711660
tax 0.86706461 -0.832332218 0.24904935
ptratio 0.21355380 -0.110643988 -0.18716017
black -0.01780187 0.008088268 -0.03064577
lstat 0.23409949 -0.104352523 0.29032605
medv -0.14661672 0.063424182 0.97435297
Proportion of trace:
LD1 LD2 LD3
0.7143 0.2084 0.0773
Call:
lda(clu5 ~ ., data = boston_scaled_new)
Prior probabilities of groups:
1 2 3 4 5
0.10671937 0.06916996 0.39328063 0.23913043 0.19169960
Group means:
crim zn indus nox rm age
1 -0.2753323 -0.4872402 1.5337294 1.1273809 -0.6003284 0.9334996
2 1.4802645 -0.4872402 1.0149946 0.9676887 -0.2969389 0.7656016
3 -0.3884901 -0.3308141 -0.4873088 -0.4761310 -0.2318056 -0.1989968
4 -0.3981339 1.2930469 -0.9902994 -0.8283387 1.0566896 -0.9121115
5 0.9128084 -0.4872402 1.0149946 1.0333132 -0.4012324 0.7501117
dis rad tax ptratio black lstat
1 -0.8995039 -0.6096828 0.01485481 -0.3917541 -0.1262348 0.6474932
2 -0.8580043 1.6596029 1.52941294 0.8057784 -3.2970564 1.1699052
3 0.2356027 -0.5732709 -0.60914944 0.1061891 0.3164212 -0.1478389
4 0.9121798 -0.5955687 -0.67325561 -0.8788401 0.3554817 -0.9544043
5 -0.8108798 1.6596029 1.52941294 0.8057784 0.1673459 0.7112530
medv
1 -0.4618433
2 -1.0473739
3 -0.1012054
4 1.0995470
5 -0.5289453
Coefficients of linear discriminants:
LD1 LD2 LD3 LD4
crim 0.11743258 0.04770450 -0.14300153 -0.16024852
zn 0.34092540 0.12344732 -0.08107374 -1.23725268
indus 0.53753143 -1.85979858 0.92454526 -1.10944073
nox -0.00906438 -0.29233935 0.67637180 -0.69383481
rm -0.05331958 0.27825890 -0.27651832 -0.52558718
age -0.02982587 -0.24810645 0.11616057 0.17202368
dis -0.18913313 -0.03732336 0.16315838 -0.26855975
rad 5.88697673 2.04722673 0.02514581 0.50932714
tax 0.13814687 0.14410388 0.09950265 -0.25979580
ptratio 0.20734779 0.06893525 -0.07716785 0.22476182
black -0.33376654 1.18823488 2.14340012 -0.03614042
lstat -0.01997449 0.13151481 -0.19758426 -0.25835224
medv -0.11300541 0.24641561 -0.16457165 -0.62281824
Proportion of trace:
LD1 LD2 LD3 LD4
0.8192 0.0840 0.0681 0.0287
Call:
lda(clu6 ~ ., data = boston_scaled_new)
Prior probabilities of groups:
1 2 3 4 5 6
0.30237154 0.12845850 0.21146245 0.09683794 0.05731225 0.20355731
Group means:
crim zn indus nox rm age
1 -0.3975627 -0.1637000 -0.5873353 -0.6615915 -0.1682976 -0.6574663
2 -0.4140702 2.2813035 -1.1562550 -1.1768217 0.7293996 -1.4086475
3 -0.3231486 -0.4822312 0.6308150 0.5041272 -0.5223445 0.7753457
4 -0.3680230 -0.1494734 -0.7440331 -0.2107180 1.7049351 0.1966439
5 3.0022987 -0.4872402 1.0149946 1.0593345 -1.3064650 0.9805356
6 0.5173303 -0.4872402 1.0149946 1.0036872 -0.1109215 0.6904986
dis rad tax ptratio black lstat
1 0.5448232 -0.5720261 -0.6911565 -0.06248294 0.36026575 -0.3806108
2 1.5645584 -0.6656010 -0.5702843 -0.80946918 0.35416061 -0.9741045
3 -0.5711540 -0.5943977 -0.1915536 0.08745080 0.03673407 0.5492158
4 -0.3113322 -0.5037339 -0.7871603 -1.09274698 0.34883211 -0.9623279
5 -1.0484716 1.6596029 1.5294129 0.80577843 -1.19066142 1.8708759
6 -0.7599982 1.6596029 1.5294129 0.80577843 -0.62752658 0.5406100
medv
1 0.01419929
2 0.81491803
3 -0.48624639
4 1.73167306
5 -1.31020021
6 -0.48514538
Coefficients of linear discriminants:
LD1 LD2 LD3 LD4 LD5
crim 0.26905214 -0.04038901 0.952783084 1.05716987 0.83157632
zn 0.02554270 -1.56332967 0.861975978 -1.07176445 0.45885648
indus 0.26308114 0.34083276 0.299613031 -0.64713459 -0.27149224
nox -0.03462213 0.07174078 -0.011080924 -0.44577196 0.17727824
rm -0.09400180 -0.18081151 -0.748312466 -0.28307867 0.35366014
age 0.09743790 0.59122392 -0.124953000 -0.52102866 0.63971252
dis -0.28508543 -0.26165892 -0.004961456 0.09291479 -0.31814880
rad 5.99463004 -1.10320047 -1.528777174 0.82447556 -1.05043782
tax 0.09961052 -0.43262170 0.468017911 -0.81595269 0.38580292
ptratio 0.25749486 0.08897531 0.282642121 -0.34518855 -0.02514549
black -0.04609198 -0.01324538 -0.083403025 0.01031609 -0.10644781
lstat 0.09758566 0.09732902 0.327335484 0.20747689 0.64677037
medv -0.06198719 -0.18784402 -0.206662871 0.28964586 0.98112363
Proportion of trace:
LD1 LD2 LD3 LD4 LD5
0.8529 0.0830 0.0298 0.0185 0.0159
Data points are of course in same positions. Grouping differs slightly in main group if colours are coded either by crime or by cluster. In the separate group high-crime is well isolated whereas in clusters, there are two of them. If colours are coded by crime, particularly the high-crime is better gathered to one group.
Data for this analysis is from the United Nations Development Programme. Two datasets have been combined, Human Development Index (HDI) and Gender Inequality Index (GII). As a result, the new dataset includes following variables:
Most of variables are to some extend skewed. In correlation matrix high positive and negative correlations can be throughout seen. Actually, most of the variables of this dataset correlated strongly. Only F_Rep_Parl and Work_Rto_FM had weak correlations to all other variables (except each other).
Life_Exp Edu_Exp GNI Mat_Mortal
Min. :49.00 Min. : 5.40 Min. : 581 Min. : 1.0
1st Qu.:66.30 1st Qu.:11.25 1st Qu.: 4198 1st Qu.: 11.5
Median :74.20 Median :13.50 Median : 12040 Median : 49.0
Mean :71.65 Mean :13.18 Mean : 17628 Mean : 149.1
3rd Qu.:77.25 3rd Qu.:15.20 3rd Qu.: 24512 3rd Qu.: 190.0
Max. :83.50 Max. :20.20 Max. :123124 Max. :1100.0
Teen_Preg F_Rep_Parl Edu_Rto_FM Work_Rto_FM
Min. : 0.60 Min. : 0.00 Min. :0.1717 Min. :0.1857
1st Qu.: 12.65 1st Qu.:12.40 1st Qu.:0.7264 1st Qu.:0.5984
Median : 33.60 Median :19.30 Median :0.9375 Median :0.7535
Mean : 47.16 Mean :20.91 Mean :0.8529 Mean :0.7074
3rd Qu.: 71.95 3rd Qu.:27.95 3rd Qu.:0.9968 3rd Qu.:0.8535
Max. :204.80 Max. :57.50 Max. :1.4967 Max. :1.0380
In principal component analysis the original multidimensional data is transformed to new space. Most of the cases the number of dimensions has been diminished. The new dimensions are called principal components, which are uncorrelated to each other. The first principal component explains maximum amount of variance in original data. The second component is orthogonal to the first and explains maximum variance left. This principle continues similarly with next components and each of them are less important compared to the previous one. If eigenvalue of the component is smaller than one, it usually can be left out. In biplots, PC1 is usually visualized on x-axis and PC2 in y-axis. Arrows and labels show connections between the original features and the principal components (usually PC1 and PC2).
In biplots angles between arrows correspond correlations of the original features. The smaller the angle, the higher is positive correlation. If two arrows are parallel but point to opposite directions, it indicates high negative correlation. 90 degree’s angle means zero correlation (orthogonal = compare to x-axis (PC1) and y-axis (PC2)). The length of the arrows shows the standard deviations of the original features.
Life_Exp Edu_Exp GNI Mat_Mortal
Min. :49.00 Min. : 5.40 Min. : 581 Min. : 1.0
1st Qu.:66.30 1st Qu.:11.25 1st Qu.: 4198 1st Qu.: 11.5
Median :74.20 Median :13.50 Median : 12040 Median : 49.0
Mean :71.65 Mean :13.18 Mean : 17628 Mean : 149.1
3rd Qu.:77.25 3rd Qu.:15.20 3rd Qu.: 24512 3rd Qu.: 190.0
Max. :83.50 Max. :20.20 Max. :123124 Max. :1100.0
Teen_Preg F_Rep_Parl Edu_Rto_FM Work_Rto_FM
Min. : 0.60 Min. : 0.00 Min. :0.1717 Min. :0.1857
1st Qu.: 12.65 1st Qu.:12.40 1st Qu.:0.7264 1st Qu.:0.5984
Median : 33.60 Median :19.30 Median :0.9375 Median :0.7535
Mean : 47.16 Mean :20.91 Mean :0.8529 Mean :0.7074
3rd Qu.: 71.95 3rd Qu.:27.95 3rd Qu.:0.9968 3rd Qu.:0.8535
Max. :204.80 Max. :57.50 Max. :1.4967 Max. :1.0380
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
PC8
Standard deviation 0.1591
Proportion of Variance 0.0000
Cumulative Proportion 1.0000
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
100 0 0 0 0 0 0 0
Due to PCA is sensitive to the scaling, large variance of the original feature get greater importance compared to a feature with smaller variance. Standardization of the original data equalizes importance between original fatures.
In following table and biplot below, PCA analysis itself is the same as above, but original data has been standardized before calculating PCA.
Variable GNI has extreme values compared to the all other variables. If data is not standardized, it change results significantly. As seen in table, in original data, importance of PC1 is 100%, due to GNI. When data is standardized, the values of GNI are closer to other variables and there are more explanatory variables, which have significance. Now, PC1 explains 53% of variance and PC2 16%.
It must be mentioned, if you do PCA analysis with PCA function (FactoMineR) instead of prcomp (used in this analysis), it gives similar importance for PCs whether data is standardized or not.
Life_Exp Edu_Exp GNI Mat_Mortal
Min. :-2.7188 Min. :-2.7378 Min. :-0.9193 Min. :-0.6992
1st Qu.:-0.6425 1st Qu.:-0.6782 1st Qu.:-0.7243 1st Qu.:-0.6496
Median : 0.3056 Median : 0.1140 Median :-0.3013 Median :-0.4726
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.6717 3rd Qu.: 0.7126 3rd Qu.: 0.3712 3rd Qu.: 0.1932
Max. : 1.4218 Max. : 2.4730 Max. : 5.6890 Max. : 4.4899
Teen_Preg F_Rep_Parl Edu_Rto_FM Work_Rto_FM
Min. :-1.1325 Min. :-1.8203 Min. :-2.8189 Min. :-2.6247
1st Qu.:-0.8394 1st Qu.:-0.7409 1st Qu.:-0.5233 1st Qu.:-0.5484
Median :-0.3298 Median :-0.1403 Median : 0.3503 Median : 0.2316
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.6030 3rd Qu.: 0.6127 3rd Qu.: 0.5958 3rd Qu.: 0.7350
Max. : 3.8344 Max. : 3.1850 Max. : 2.6646 Max. : 1.6632
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
PC7 PC8
Standard deviation 0.45900 0.32224
Proportion of Variance 0.02634 0.01298
Cumulative Proportion 0.98702 1.00000
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
PC1 includes most of the highly correlated variables together and could be interepretated as “goodness” of life. Bigger Gross Nation Income, longer life expectancy, longer education and Ratio between females and males in education (closer to 1) all can be expected to make life and society better. In the same component Maternal mortality and Teen pregnancy correlate negatively and make life worse. Component PC2 could explain equality between genders in a society. If Female members in parliament (%) is bigger, it usually mean that Ratio between females and males in labour is also closer to 1 and women are more equal members of the society.
Tea data is included to R package and is a questionnaire on tea. 300 participant were asked how they drink tea, what are their product’s perception and some personal details. There are totally 36 variables, of which 6 are included into this dataset.
Categorical variables (factor) are:
In MCA plots categorical variables are grouped by correlations to each other or components. For example in biplot below can be seen that unpackaged tea is usually drunk in tea shops, whereas tea bag is mostly used in chain stores. Actually in both dimensions variables “how” (Tea bag, Tea bag + unpackaged, Unpackaged) and “where” (Chain store, Chain store + tea shop, Tea shop) were the most significant variables.
Call:
MCA(X = tea_time, graph = FALSE)
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
Variance 0.279 0.261 0.219 0.189 0.177 0.156
% of var. 15.238 14.232 11.964 10.333 9.667 8.519
Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
Variance 0.144 0.141 0.117 0.087 0.062
% of var. 7.841 7.705 6.392 4.724 3.385
Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
Individuals (the 10 first)
Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
ctr cos2
1 0.163 0.104 |
2 0.735 0.314 |
3 0.062 0.069 |
4 0.068 0.073 |
5 0.062 0.069 |
6 0.062 0.069 |
7 0.062 0.069 |
8 0.735 0.314 |
9 0.007 0.003 |
10 0.643 0.261 |
Categories (the 10 first)
Dim.1 ctr cos2 v.test Dim.2 ctr
black | 0.473 3.288 0.073 4.677 | 0.094 0.139
Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
green | 0.486 1.547 0.029 2.952 | -0.933 6.111
alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
other | 0.288 0.148 0.003 0.876 | 1.820 6.347
tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
cos2 v.test Dim.3 ctr cos2 v.test
black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
Categorical variables (eta2)
Dim.1 Dim.2 Dim.3
Tea | 0.126 0.108 0.410 |
How | 0.076 0.190 0.394 |
how | 0.708 0.522 0.010 |
sugar | 0.065 0.001 0.336 |
where | 0.702 0.681 0.055 |
lunch | 0.000 0.064 0.111 |
If the same person makes several observations in same experiment or the weight of the same animal is measured several times during the longer time or in different conditions, those measurements are hardly independent. This must be also taken into account in analysis models of such kind of data.
Two such kind of datasets are wrangled and analyzed in this excercise. First set, RATS, includes information about different nutritions and their influence to the weight of rats during 9 weeks period. In the second set, BPRS, 40 male subjects were randomly assigned to two groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment and then at weekly intervals for eight weeks.The scale is used to evaluate patients suspected of having schizophrenia.
Both datasets have been converted to long form for analyses. The data wrangling code can be explored here.
Observations: 16
Variables: 13
$ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
$ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
$ WD1 <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 5...
$ WD8 <int> 250, 230, 250, 255, 260, 265, 275, 255, 415, 420, 445, 5...
$ WD15 <int> 255, 230, 250, 255, 255, 270, 260, 260, 425, 430, 450, 5...
$ WD22 <int> 260, 232, 255, 265, 270, 275, 270, 268, 428, 440, 452, 5...
$ WD29 <int> 262, 240, 262, 265, 270, 275, 273, 270, 438, 448, 455, 5...
$ WD36 <int> 258, 240, 265, 268, 273, 277, 274, 265, 443, 460, 455, 5...
$ WD43 <int> 266, 243, 267, 270, 274, 278, 276, 265, 442, 458, 451, 5...
$ WD44 <int> 266, 244, 267, 272, 273, 278, 271, 267, 446, 464, 450, 5...
$ WD50 <int> 265, 238, 264, 274, 276, 284, 282, 273, 456, 475, 462, 6...
$ WD57 <int> 272, 247, 268, 273, 278, 279, 281, 274, 468, 484, 466, 6...
$ WD64 <int> 278, 245, 269, 275, 280, 281, 284, 278, 478, 496, 472, 6...
Observations: 176
Variables: 5
$ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
$ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1...
$ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1",...
$ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, ...
$ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8...
Call:
lm(formula = RATSL$Weight ~ RATSL$Time + RATSL$Group, data = RATSL)
Residuals:
Min 1Q Median 3Q Max
-60.643 -24.017 0.697 10.837 125.459
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 244.0689 5.7725 42.281 < 2e-16 ***
RATSL$Time 0.5857 0.1331 4.402 1.88e-05 ***
RATSL$Group2 220.9886 6.3402 34.855 < 2e-16 ***
RATSL$Group3 262.0795 6.3402 41.336 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 34.34 on 172 degrees of freedom
Multiple R-squared: 0.9283, Adjusted R-squared: 0.9271
F-statistic: 742.6 on 3 and 172 DF, p-value: < 2.2e-16
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: Weight ~ Time + Group + (1 | ID)
Data: RATSL
AIC BIC logLik deviance df.resid
1333.2 1352.2 -660.6 1321.2 170
Scaled residuals:
Min 1Q Median 3Q Max
-3.5386 -0.5581 -0.0494 0.5693 3.0990
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 1085.92 32.953
Residual 66.44 8.151
Number of obs: 176, groups: ID, 16
Fixed effects:
Estimate Std. Error t value
(Intercept) 244.06890 11.73107 20.80
Time 0.58568 0.03158 18.54
Group2 220.98864 20.23577 10.92
Group3 262.07955 20.23577 12.95
Correlation of Fixed Effects:
(Intr) Time Group2
Time -0.090
Group2 -0.575 0.000
Group3 -0.575 0.000 0.333
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: Weight ~ Time + Group + (Time | ID)
Data: RATSL
AIC BIC logLik deviance df.resid
1194.2 1219.6 -589.1 1178.2 168
Scaled residuals:
Min 1Q Median 3Q Max
-3.2261 -0.4322 0.0555 0.5638 2.8827
Random effects:
Groups Name Variance Std.Dev. Corr
ID (Intercept) 1140.5363 33.7718
Time 0.1122 0.3349 -0.22
Residual 19.7456 4.4436
Number of obs: 176, groups: ID, 16
Fixed effects:
Estimate Std. Error t value
(Intercept) 246.45727 11.81526 20.859
Time 0.58568 0.08548 6.852
Group2 214.58736 20.17983 10.634
Group3 258.92732 20.17983 12.831
Correlation of Fixed Effects:
(Intr) Time Group2
Time -0.166
Group2 -0.569 0.000
Group3 -0.569 0.000 0.333
Data: RATSL
Models:
RATS_ref: Weight ~ Time + Group + (1 | ID)
RATS_ref1: Weight ~ Time + Group + (Time | ID)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
RATS_ref 6 1333.2 1352.2 -660.58 1321.2
RATS_ref1 8 1194.2 1219.6 -589.11 1178.2 142.94 2 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: Weight ~ Time + Group + (Time | ID) + (Time | Group)
Data: RATSL
AIC BIC logLik deviance df.resid
1195.8 1230.7 -586.9 1173.8 165
Scaled residuals:
Min 1Q Median 3Q Max
-3.2547 -0.4134 0.0710 0.5961 2.7906
Random effects:
Groups Name Variance Std.Dev. Corr
ID (Intercept) 1.114e+03 33.3728
Time 6.141e-02 0.2478 -0.16
Group (Intercept) 2.646e+01 5.1439
Time 5.006e-02 0.2237 -1.00
Residual 1.975e+01 4.4436
Number of obs: 176, groups: ID, 16; Group, 3
Fixed effects:
Estimate Std. Error t value
(Intercept) 245.0357 12.1221 20.214
Time 0.6474 0.1456 4.447
Group2 214.5926 20.1799 10.634
Group3 258.9301 20.1799 12.831
Correlation of Fixed Effects:
(Intr) Time Group2
Time -0.276
Group2 -0.555 0.000
Group3 -0.555 0.000 0.333
Data: RATSL
Models:
RATS_ref2: Weight ~ Time + Group + (Time | ID) + (Time | Group)
RATS_ref1: Weight ~ Time + Group + (Time | ID) + (Time | Group)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
RATS_ref2 11 1195.8 1230.7 -586.92 1173.8
RATS_ref1 11 1195.8 1230.7 -586.92 1173.8 0 0 1
[1] "treatment" "subject" "week0" "week1" "week2"
[6] "week3" "week4" "week5" "week6" "week7"
[11] "week8"
'data.frame': 40 obs. of 11 variables:
$ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
$ subject : int 1 2 3 4 5 6 7 8 9 10 ...
$ week0 : int 42 58 54 55 72 48 71 30 41 57 ...
$ week1 : int 36 68 55 77 75 43 61 36 43 51 ...
$ week2 : int 36 61 41 49 72 41 47 38 39 51 ...
$ week3 : int 43 55 38 54 65 38 30 38 35 55 ...
$ week4 : int 41 43 43 56 50 36 27 31 28 53 ...
$ week5 : int 40 34 28 50 39 29 40 26 22 43 ...
$ week6 : int 38 28 29 47 32 33 30 26 20 43 ...
$ week7 : int 47 28 25 42 38 27 31 25 23 39 ...
$ week8 : int 51 28 24 46 32 25 31 24 21 32 ...
treatment subject week0 week1
Min. :1.0 Min. : 1.00 Min. :24.00 Min. :23.00
1st Qu.:1.0 1st Qu.: 5.75 1st Qu.:38.00 1st Qu.:35.00
Median :1.5 Median :10.50 Median :46.00 Median :41.00
Mean :1.5 Mean :10.50 Mean :48.00 Mean :46.33
3rd Qu.:2.0 3rd Qu.:15.25 3rd Qu.:58.25 3rd Qu.:54.25
Max. :2.0 Max. :20.00 Max. :78.00 Max. :95.00
week2 week3 week4 week5
Min. :26.0 Min. :24.00 Min. :20.00 Min. :20.00
1st Qu.:32.0 1st Qu.:29.75 1st Qu.:28.00 1st Qu.:26.00
Median :38.0 Median :36.50 Median :34.50 Median :30.50
Mean :41.7 Mean :39.15 Mean :36.35 Mean :32.55
3rd Qu.:49.0 3rd Qu.:44.50 3rd Qu.:43.00 3rd Qu.:38.00
Max. :75.0 Max. :76.00 Max. :66.00 Max. :64.00
week6 week7 week8
Min. :19.00 Min. :18.0 Min. :20.00
1st Qu.:22.75 1st Qu.:23.0 1st Qu.:22.75
Median :28.50 Median :30.0 Median :28.00
Mean :31.23 Mean :32.2 Mean :31.43
3rd Qu.:37.00 3rd Qu.:38.0 3rd Qu.:35.25
Max. :64.00 Max. :62.0 Max. :75.00
Observations: 360
Variables: 5
$ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
$ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0"...
$ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, ...
$ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
Observations: 360
Variables: 6
$ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
$ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0"...
$ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, ...
$ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ stdbprs <dbl> -0.4245908, 0.7076513, 0.4245908, 0.4953559, 1.69836...
Observations: 18
Variables: 4
$ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2
$ week <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7, 8
$ mean <dbl> 47.00, 46.80, 43.55, 40.90, 36.60, 32.70, 29.70, 29....
$ se <dbl> 4.534468, 5.173708, 4.003617, 3.744626, 3.259534, 2....
Observations: 40
Variables: 3
$ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
$ mean <dbl> 41.500, 43.125, 35.375, 52.625, 50.375, 34.000, 37.1...
Two Sample t-test
data: mean by treatment
t = 0.50707, df = 37, p-value = 0.6151
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.488899 4.150466
sample estimates:
mean in group 1 mean in group 2
32.41750 31.58672
Analysis of Variance Table
Response: mean
Df Sum Sq Mean Sq F value Pr(>F)
baseline 1 1868.07 1868.07 30.1437 3.077e-06 ***
treatment 1 3.45 3.45 0.0557 0.8148
Residuals 37 2292.97 61.97
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
During the Analysis exercise you will explore both data sets graphically and perform some initial analyses that require various data conversions. Finally you will create and test a few advanced statistical models and, of course, interpret the results in all phases of the analysis. The Analysis exercise focuses on exploring your data and performing and interpreting simple straightforward methods (based on summary measures of data) and more advanced methods (called linear mixed effects models). For completing the Analysis exercises, include all the codes, your interpretations and explanations in the R Markdown file chapter6.Rmd.
Implement the analyses of Chapter 8 of MABS using the RATS data. (0-7 points)
Implement the analyses of Chapter 9 of MABS using the BPRS data. (0-8 points)